AITopics | parallelism strategy

Collaborating Authors

parallelism strategy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism

Liu, Zedong, Cheng, Shenggan, Tan, Guangming, You, Yang, Tao, Dingwen

arXiv.org Artificial IntelligenceNov-12-2025

Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components -- combined with complex inference pipelines and heterogeneous workloads -- introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we introduce Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.10069

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Jin, Chao, Jiang, Ziheng, Bai, Zhihao, Zhong, Zheng, Liu, Juncai, Li, Xiang, Zheng, Ningxin, Wang, Xi, Xie, Cong, Huang, Qi, Heng, Wen, Ma, Yiyuan, Bao, Wenlei, Zheng, Size, Peng, Yanghua, Lin, Haibin, Liu, Xuanzhe, Jin, Xin, Liu, Xin

arXiv.org Artificial IntelligenceOct-20-2025

We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.11432

Country: Europe > United Kingdom > Scotland (0.16)

Genre: Research Report (0.82)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

Go, Seokjin, Park, Joongun, More, Spandan, Wu, Hanjiang, Wang, Irene, Jezghani, Aaron, Krishna, Tushar, Mahajan, Divya

arXiv.org Artificial IntelligenceSep-22-2025

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper, we present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms, including NVIDIA H100/H200 and AMD MI250 GPUs. We analyze dense and sparse models under various parallelism strategies -- tensor, pipeline, data, and expert -- and evaluate their effects on hardware utilization, power consumption, and thermal behavior. We further evaluate the effectiveness of optimizations such as activation recomputation and compute-communication overlap. Our findings show that performance is not determined solely by scaling hardware capacity. Scale-up systems with fewer, higher-memory GPUs can outperform scale-out systems in communication-bound regimes, but only under carefully tuned configurations; in other cases, scale-out deployments achieve superior throughput. We also show that certain parallelism combinations, such as tensor with pipeline, lead to bandwidth underutilization due to inefficient data chunking, while increasing microbatch sizes beyond a certain point induces bursty execution and peak power excursions that worsen thermal throttling. These insights reveal how training performance is shaped by complex interactions between hardware, system topology, and model execution. We conclude by offering recommendations for system and hardware design to improve the scalability and reliability of future LLM systems and workloads. The source code of this project is available at https://github.com/sitar-lab/CharLLM-PPT.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.10371

Country: North America > United States > Georgia > Fulton County > Atlanta (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Hardware (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Tesserae: Scalable Placement Policies for Deep Learning Workloads

Bian, Song, Agarwal, Saurabh, Mahmood, Md. Tareq, Venkataraman, Shivaram

arXiv.org Artificial IntelligenceAug-8-2025

Training deep learning (DL) models has become a dominant workload in data-centers and improving resource utilization is a key goal of DL cluster schedulers. In order to do this, schedulers typically incorporate placement policies that govern where jobs are placed on the cluster. Existing placement policies are either designed as ad-hoc heuristics or incorporated as constraints within a complex optimization problem and thus either suffer from suboptimal performance or poor scalability. Our key insight is that many placement constraints can be formulated as graph matching problems and based on that we design novel placement policies for minimizing job migration overheads and job packing. We integrate these policies into Tesserae and describe how our design leads to a scalable and effective GPU cluster scheduler. Our experimental results show that Tesserae improves average JCT by up to 1.62x and the Makespan by up to 1.15x compared with the existing schedulers.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.04953

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Two-dimensional Sparse Parallelism for Large Scale Deep Learning Recommendation Model Training

Zhang, Xin, Zhu, Quanyu, Xu, Liangbei, Huda, Zain, Zhou, Wang, Fang, Jin, van der Staay, Dennis, Hu, Yuxi, Nie, Jade, Yang, Jiyan, Yang, Chunzhi

arXiv.org Artificial IntelligenceAug-7-2025

The increasing complexity of deep learning recommendation models (DLRM) has led to a growing need for large-scale distributed systems that can efficiently train vast amounts of data. In DLRM, the sparse embedding table is a crucial component for managing sparse categorical features. Typically, these tables in industrial DLRMs contain trillions of parameters, necessitating model parallelism strategies to address memory constraints. However, as training systems expand with massive GPUs, the traditional fully parallelism strategies for embedding table post significant scalability challenges, including imbalance and straggler issues, intensive lookup communication, and heavy embedding activation memory. To overcome these limitations, we propose a novel two-dimensional sparse parallelism approach. Rather than fully sharding tables across all GPUs, our solution introduces data parallelism on top of model parallelism. This enables efficient all-to-all communication and reduces peak memory consumption. Additionally, we have developed the momentum-scaled row-wise AdaGrad algorithm to mitigate performance losses associated with the shift in training paradigms. Our extensive experiments demonstrate that the proposed approach significantly enhances training efficiency while maintaining model performance parity. It achieves nearly linear training speed scaling up to 4K GPUs, setting a new state-of-the-art benchmark for recommendation model training.

artificial intelligence, machine learning, parallelism, (15 more...)

arXiv.org Artificial Intelligence

2508.03854

Country: North America > United States (0.30)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core

Liu, Dennis, Yan, Zijie, Yao, Xin, Liu, Tong, Korthikanti, Vijay, Wu, Evan, Fan, Shiqing, Deng, Gao, Bai, Hongxiao, Chang, Jianbin, Aithal, Ashwath, Andersch, Michael, Shoeybi, Mohammad, Yao, Jiajie, Zhou, Chandler, Wu, David, Li, Xipeng, Yang, June

arXiv.org Artificial IntelligenceApr-24-2025

Mixture of Experts (MoE) models enhance neural network scalability by dynamically selecting relevant experts per input token, enabling larger model sizes while maintaining manageable computation costs. However, efficient training of large-scale MoE models across thousands of GPUs presents significant challenges due to limitations in existing parallelism strategies. We introduce an end-to-end training framework for large-scale MoE models that utilizes five-dimensional hybrid parallelism: Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallelism, and Pipeline Parallelism. Central to our approach is MoE Parallel Folding, a novel strategy that decouples the parallelization of attention and MoE layers in Transformer models, allowing each layer type to adopt optimal parallel configurations. Additionally, we develop a flexible token-level dispatcher that supports both token-dropping and token-dropless MoE training across all five dimensions of parallelism. This dispatcher accommodates dynamic tensor shapes and coordinates different parallelism schemes for Attention and MoE layers, facilitating complex parallelism implementations. Our experiments demonstrate significant improvements in training efficiency and scalability. We achieve up to 49.3% Model Flops Utilization (MFU) for the Mixtral 8x22B model and 39.0% MFU for the Qwen2-57B-A14B model on H100 GPUs, outperforming existing methods. The framework scales efficiently up to 1,024 GPUs and maintains high performance with sequence lengths up to 128K tokens, validating its effectiveness for large-scale MoE model training. The code is available in Megatron-Core.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2504.1496

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

Liang, Mingyu, Kassa, Hiwot Tadese, Fu, Wenyin, Coutinho, Brian, Feng, Louis, Delimitrou, Christina

arXiv.org Artificial IntelligenceApr-15-2025

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model's behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2504.09307

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.51)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs

Ge, Hao, Feng, Junda, Huang, Qi, Fu, Fangcheng, Nie, Xiaonan, Zuo, Lei, Lin, Haibin, Cui, Bin, Liu, Xin

arXiv.org Artificial IntelligenceFeb-28-2025

Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.

communication, computation, sequence, (15 more...)

arXiv.org Artificial Intelligence

2502.21231

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States (0.04)
Asia > Middle East > Jordan (0.04)
Europe > Italy > Lombardy > Milan (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Astra: Efficient and Money-saving Automatic Parallel Strategies Search on Heterogeneous GPUs

Wang, Peiran, Li, Haibing, Haohan, Fu, Li, Shiyong, Wang, Yanpeng, Shen, Dou

arXiv.org Artificial IntelligenceFeb-19-2025

In this paper, we introduce an efficient and money-saving automatic parallel strategies search framework on heterogeneous GPUs: Astra. First, Astra searches for the efficiency-optimal parallel strategy in both GPU configurations search space (GPU types and GPU numbers) and parallel parameters search space. Then, Astra also provides the solution on heterogeneous GPUs by mathematically modeling the time consumption of heterogeneous training. At last, Astra is the first to propose the automatic parallel strategy search on money-saving. The experiment results demonstrate that Astra can achieve better throughput than expert-designed strategies. The search time cost for Astra can also be limited to 1.27 seconds in a single-GPU setting and less than 1.35 minutes in a heterogeneous-GPU setting on average with an accuracy of over 95%.

astra, configuration, parallelism, (15 more...)

arXiv.org Artificial Intelligence

2502.1348

Country:

North America > United States > California > San Diego County > Carlsbad (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.70)

Add feedback

LLM360 K2: Building a 65B 360-Open-Source Large Language Model from Scratch

Liu, Zhengzhong, Tan, Bowen, Wang, Hongyi, Neiswanger, Willie, Tao, Tianhua, Li, Haonan, Koto, Fajri, Wang, Yuqi, Sun, Suqi, Pangarkar, Omkar, Fan, Richard, Gu, Yi, Miller, Victor, Ma, Liqun, Tang, Liping, Ranjan, Nikhil, Zhuang, Yonghao, He, Guowei, Wang, Renxi, Deng, Mingkai, Algayres, Robin, Li, Yuanzhi, Shen, Zhiqiang, Nakov, Preslav, Xing, Eric

arXiv.org Artificial IntelligenceJan-16-2025

We detail the training of the LLM360 K2-65B model, scaling up our 360-degree OPEN SOURCE approach to the largest and most powerful models under project LLM360. While open-source LLMs continue to advance, the answer to "How are the largest LLMs trained?" remains unclear within the community. The implementation details for such high-capacity models are often protected due to business considerations associated with their high cost. This lack of transparency prevents LLM researchers from leveraging valuable insights from prior experience, e.g., "What are the best practices for addressing loss spikes?" The LLM360 K2 project addresses this gap by providing full transparency and access to resources accumulated during the training of LLMs at the largest scale. This report highlights key elements of the K2 project, including our first model, K2 DIAMOND, a 65 billion-parameter LLM that surpasses LLaMA-65B and rivals LLaMA2-70B, while requiring fewer FLOPs and tokens. We detail the implementation steps and present a longitudinal analysis of K2 DIAMOND's capabilities throughout its training process. We also outline ongoing projects such as TXT360, setting the stage for future models in the series. By offering previously unavailable resources, the K2 project also resonates with the 360-degree OPEN SOURCE principles of transparency, reproducibility, and accessibility, which we believe are vital in the era of resource-intensive AI research.

correct answer, long-context stage, parallelism strategy, (16 more...)

arXiv.org Artificial Intelligence

2501.07124

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Beijing > Beijing (0.04)
(9 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Law (1.00)
Information Technology > Security & Privacy (1.00)
(7 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback